Vision-based Active Speaker Detection in Multiparty Interactions
نویسندگان
چکیده
This paper presents a supervised learning method for automatic visual detection of the active speaker in multiparty interactions. The presented detectors are built using a multimodal multiparty interaction dataset previously recorded with the purpose to explore patterns in the focus of visual attention of humans. Three different conditions are included: two humans involved in taskbased interaction with a robot; the same two humans involved in task-based interaction where the robot is replaced by a third human, and a free three-party human interaction. The paper also presents an evaluation of the active speaker detection method in a speaker dependent experiment showing that the method achieves good accuracy rates in a fairly unconstrained scenario using only image data as input. The main goal of the presented method is to provide real-time detection of the active speaker within a broader framework implemented on a robot and used to generate natural focus of visual attention behavior during multiparty human-robot interactions.
منابع مشابه
Towards Speaker Detection using Lips Movements for Human-Machine Multiparty Dialogue
This paper explores the use of lips movements for the purpose of speaker and voice activity detection, a task that is essential in multi-modal multiparty human machine dialogue. The task aims at detecting who and when someone is speaking out of a set of persons. A multiparty dialogue consisting of 4 speakers is audiovisually recorded and then annotated for speaker and speech/silence segments. L...
متن کاملAutomatic social role recognition and its application in structuring multiparty interactions
Automatic processing of multiparty interactions is a research domain with important applications in content browsing, summarization and information retrieval. In recent years, several works have been devoted to find regular patterns which speakers exhibit in a multiparty interaction also known as social roles. Most of the research in literature has generally focused on recognition of scenario s...
متن کاملFloor holder detection and end of speaker turn prediction in meetings
We propose a novel fully automatic framework to detect which meeting participant is currently holding the conversational floor and when the current speaker turn is going to finish. Two sets of experiments were conducted on a large collection of multiparty conversations: the AMI meeting corpus. Unsupervised speaker turn detection was performed by post-processing the speaker diarization and the s...
متن کاملCross-Modal Supervision for Learning Active Speaker Detection in Video
In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion facial expressions and gesticulations associated with speaking. We further improve a generic model for acti...
متن کاملSpatio-Temporal Analysis of Spontaneous Speech with Microphone Arrays
Accurate detection, localization and tracking of multiple moving speakers permits a wide spectrum of applications. Techniques are required that are versatile, robust to environmental variations, and not constraining for non-technical end-users. Based on distant recording of spontaneous multiparty conversations, this thesis focuses on the use of microphone arrays to address the question “Who spo...
متن کامل